The graphs below don’t have proper titles, axis labels,
legends, etc. Please take care to do this on your own graphs.
Throughout this demo we will use the palmerpenguins
dataset. To access the data, you will need to install the
palmerpenguins package:
install.packages("palmerpenguins")
We use geom_point() to generate scatterplots (requires
x and y aes)
library(tidyverse)
library(palmerpenguins)
data(penguins)
penguins %>%
ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point()
Scatterplots display the joint distribution of the variables along
the x and y axes. You should always adjust the
transparency of points via alpha to visualize overlap -
providing a better understanding of joint frequency. This is especially
important with larger datasets.
penguins %>%
ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5)
We can color by a third variable (e.g., different color for each category).
Note that you can put color = inside ggplot
or geom_point, both display the same visualization:
penguins %>%
ggplot(aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
geom_point(alpha = 0.5)
penguins %>%
ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5, aes(color = species))
You can also color by a quantitative variable using a color scale/gradient:
penguins %>%
ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5, aes(color = body_mass_g))
The default color gradient is not the most appealing, while there are a number of possibilities - blue to orange is a good choice since these colors are opposites on the color spectrum:
penguins %>%
ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5, aes(color = body_mass_g)) +
scale_color_gradient(low = "darkblue", high = "darkorange")
size)We can also map variables to other aesthetics,
e.g. size:
penguins %>%
ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5, aes(size = body_mass_g))
shape)… or the type (shape) of points:
penguins %>%
ggplot(aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point(alpha = 0.5, aes(shape = species))
We can even do several of these at once:
penguins %>%
ggplot(aes(x = bill_depth_mm, y = bill_length_mm,
color = species, shape = island,
size = body_mass_g)) +
geom_point(alpha = 0.5)
The above graph may be a bit difficult to read, but it contains a lot of information in the sense that it is a 5-dimensional graphic:
x = bill depth (mm)y = bill length (mm)color = speciessize = body mass (g)shape = islandBut be careful! The more complications you add, the more difficult your graph is to explain.
We will focus on predicting penguin bill depth
(bill_depth_mm) based on the penguin’s bill length
(bill_length_mm) and species. In particular,
we will consider the following three linear regression models:
bill_depth_mm ~ bill_length_mmbill_depth_mm ~ bill_length_mm +
speciesbill_depth_mm ~ bill_length_mm +
species + bill_length_mm \(\times\) speciesFirst, we can run a simple linear regression (the first model) based
only on bill length. We can display this line via
geom_smooth():
penguins %>%
ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE)
And display the regression model output using
summary():
summary(lm(bill_depth_mm ~ bill_length_mm, data = penguins))
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1381 -1.4263 0.0164 1.3841 4.5255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.88547 0.84388 24.749 < 2e-16 ***
## bill_length_mm -0.08502 0.01907 -4.459 1.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.922 on 340 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.05525, Adjusted R-squared: 0.05247
## F-statistic: 19.88 on 1 and 340 DF, p-value: 1.12e-05
We can write this regression model as:
\[\text{depth} \sim N(\beta_0 + \beta_L \cdot \text{length}, \sigma^2)\]
Note that \(\beta_0\) is the intercept and \(\beta_L\) is the slope.
Thus, our estimates are:
We can also run the second model, which is based on length and species, but with only additive effects. First, we’ll check the counts of the species variable to ensure that the species with the highest number of observations if the reference level (i.e., the first level for a factor variable):
table(penguins$species)
##
## Adelie Chinstrap Gentoo
## 152 68 124
Looks like we’re lucky and that the Adelie species is
the most popular and is already first due to alphabetical order.
What function would we need to do to re-order the variable?
Next, we’ll fit the regression that accounts for species
without an interaction - so it’s just an additive effect:
depth_lm_species_add <- lm(bill_depth_mm ~ bill_length_mm + species,
data = penguins)
summary(depth_lm_species_add)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm + species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4529 -0.6864 -0.0508 0.5519 3.5915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.59218 0.68302 15.508 < 2e-16 ***
## bill_length_mm 0.19989 0.01749 11.427 < 2e-16 ***
## speciesChinstrap -1.93319 0.22416 -8.624 2.55e-16 ***
## speciesGentoo -5.10602 0.19142 -26.674 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9533 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.769, Adjusted R-squared: 0.7669
## F-statistic: 375.1 on 3 and 338 DF, p-value: < 2.2e-16
This tells us that Chinstrap is different from
Adelie and Gentoo is different from
Adelie, but it does NOT tell us Chinstrap is
different from Gentoo. That would require another model
with a reordered species variable.
Exercise: Reorder species so that Gentoo
is the reference level and compare to the results above.
We can manually extract intercepts and coefficients to use for plotting (read the code comments!):
# Calculate species-specific intercepts in order:
intercepts <- # First for `Adelie` it's just the initial intercept
c(coef(depth_lm_species_add)["(Intercept)"],
# Next for `Chinstrap` it's the intercept plus the `Chinstrap` term:
coef(depth_lm_species_add)["(Intercept)"] +
coef(depth_lm_species_add)["speciesChinstrap"],
# And finally for `Gentoo` it's again the intercept plus the `Gentoo` term
coef(depth_lm_species_add)["(Intercept)"] +
coef(depth_lm_species_add)["speciesGentoo"])
# Create a small table to store the intercept, slopes, and species:
lines_tbl <- tibble("intercepts" = intercepts,
# Slopes are the same for each, thus use rep()
"slopes" = rep(coef(depth_lm_species_add)["bill_length_mm"],
3),
# And the levels of species:
"species" = levels(penguins$species))
We can now plot this model by specifying the regression lines with
geom_abline() using the newly constructed
lines_tbl as the data for this layer:
penguins %>%
ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(alpha = 0.5) +
geom_abline(data = lines_tbl,
aes(intercept = intercepts, slope = slopes,
color = species)) +
labs(x = "Bill length (mm)", y = "Bill depth (mm)",
title = "Bill depth versus weight by species")
This is a great example of Simpson’s Paradox! We originally observed a negative linear relationship between depth and length, but now observe a positive linear relationship within species!
Next, we can run the third model, which is based on length and
species, including interaction effects. This is the default type of
model displayed when we map species to the color aesthetic
for the geom_smooth() layer. In the plot below, we display
across both layers, geom_point() and
geom_smooth() by mapping species to color in
the initial ggplot canvas construction:
penguins %>%
ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(x = "Bill length (mm)", y = "Bill depth (mm)",
title = "Bill depth versus weight by species")
What about the summary of this model? Is the inclusion of interaction
terms relevant? Note that by default, multiplying two variables in the
lm() formula below includes both the
additive AND interaction terms.
depth_lm_species_int <- lm(bill_depth_mm ~ bill_length_mm * species,
data = penguins)
summary(depth_lm_species_int)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm * species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6574 -0.6675 -0.0524 0.5383 3.5032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.40912 1.13812 10.025 < 2e-16 ***
## bill_length_mm 0.17883 0.02927 6.110 2.76e-09 ***
## speciesChinstrap -3.83998 2.05398 -1.870 0.062419 .
## speciesGentoo -6.15812 1.75451 -3.510 0.000509 ***
## bill_length_mm:speciesChinstrap 0.04338 0.04558 0.952 0.341895
## bill_length_mm:speciesGentoo 0.02601 0.04054 0.642 0.521590
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9548 on 336 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7697, Adjusted R-squared: 0.7662
## F-statistic: 224.5 on 5 and 336 DF, p-value: < 2.2e-16
The interaction terms do NOT appear to be necessary to include. This is justified by both the lack of significance and the slight drop in adjusted R-squared.
Remember the meaning of the intercept term… that is not reasonable in
this setting because penguins will never have bills with length of 0mm!
We should update the additive model (since we found the interaction
terms to not be meaningful) to remove the intercept. This can be done by
adding a 0 term to the lm() formula:
depth_lm_remove_b0 <- lm(bill_depth_mm ~ 0 + bill_length_mm + species,
data = penguins)
summary(depth_lm_remove_b0)
##
## Call:
## lm(formula = bill_depth_mm ~ 0 + bill_length_mm + species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4529 -0.6864 -0.0508 0.5519 3.5915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## bill_length_mm 0.19989 0.01749 11.427 < 2e-16 ***
## speciesAdelie 10.59218 0.68302 15.508 < 2e-16 ***
## speciesChinstrap 8.65899 0.86207 10.044 < 2e-16 ***
## speciesGentoo 5.48616 0.83547 6.567 1.94e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9533 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.997, Adjusted R-squared: 0.997
## F-statistic: 2.795e+04 on 4 and 338 DF, p-value: < 2.2e-16
What changed in the summary output? Why did that occur?
We can copy-and-paste our code from above to add these appropriate regression lines:
# Calculate species-specific intercepts in order:
new_intercepts <- # First for `Adelie`
c(coef(depth_lm_remove_b0)["speciesAdelie"],
# Next for `Chinstrap`
coef(depth_lm_remove_b0)["speciesChinstrap"],
# And finally for `Gentoo`
coef(depth_lm_remove_b0)["speciesGentoo"])
# Create a small table to store the intercept, slopes, and species:
new_lines_tbl <-
tibble("intercepts" = new_intercepts,
# Slopes are the same for each, thus use rep()
"slopes" = rep(coef(depth_lm_remove_b0)["bill_length_mm"],
3),
# And the levels of species:
"species" = levels(penguins$species))
Again, create the display:
penguins %>%
ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(alpha = 0.5) +
geom_abline(data = new_lines_tbl,
aes(intercept = intercepts, slope = slopes,
color = species)) +
labs(x = "Bill length (mm)", y = "Bill depth (mm)",
title = "Bill depth versus weight by species")
Why is this the same display as before? Here’s a great description of why we observe a higher R-squared with the intercept-term excluded from the model.